_ May 4, 2011_ fabio.cecaro

DataCenter Failure : downtime in not an option

We talk about evaporated data, turbulence, hurricanes, storms, each one uses its own metaphor associated with clouds, precisely because we talk about cloud computing, disaster recovery, downtime in datacenters. We are talking about Amazon Web Service, Aruba, Google Mail, clustered storage systems, we are talking about centralized data, centralized services in datacenters that are created to do this job, human errors, planning errors and also the lightness of customers in relying completely without perhaps having read the SLA contracts, without having correctly implemented their system with the services provided by the provider, finally, without having assessed the economic and image damage resulting from downtime, the famous business continuity.

AWS Outage

Let’s start with Amazon, as already indicated in this other post on April 21st there were problems in a datacenter in the us-east area (Virginia), at the fundamental services EC2

and RDS

. These have grounded some famous social services of the internet such as Foursquare, Reddit and Quora.

But NetFlix, Twilio, and others, while using resources in the same availability zone, did not suffer the same fate, Reddit itself almost immediately returned online thanks to a strong support contract with Amazon that guaranteed it dedicated engineers. Amazon gives us a summary at the end of the story where it illustrates what happened and how they handled the event, what they learned and how they will move in the immediate future to prevent it from happening again, apologize to all customers and talk about refunding credits for customers. Obviously, the reimbursement is not measured according to the extent of the damage suffered, but this depends on the type of contract you enter into with the provider.

EBS Storage synch problems

For the EC2 service, the trigger was a subset of EBS disks in an Availability Zone. They seem to have frozen, unable to read and write, obviously the instances that depended on these disks were also blocked. They disabled the control APIs of this cluster of EBS in that affected area. At this point, Amazon briefly explains how the EBS disk service and its cluster work. As many experts will have already guessed, EBS is a distributed storage, consisting of a series of clusters containing data in consistent replication and at the block level with each other, and a series of other systems useful for coordinating user requests to the nodes. Data replicas between cluster nodes are ensured by a peer-to-peer fast-failover strategy to trigger new replicas if one goes out of sync or is no longer available. The nodes are connected to each other by two networks, the first broadband and the secondary of lower capacity used as a backup and expansion network for data replication. The second network was not created to manage all the traffic of the primary but to provide very reliable connectivity between the nodes.

The problem arose due to human error during a normal scaling activity of the primary network of an EBS cluster, the change was supposed to serve to boost its capacity, but the traffic that had to be moved to transfer the update was mistakenly transferred to the secondary network which did not hold up and the replicas jumped. The problem of EBS storage has obviously also impacted the RDS relational database service, which is totally dependent on it

According to an analysis by RightScale there would have been more than 500k EBS volumes affected, it also claims that an event of this magnitude exceeds the design parameters, cannot be tested and that there is no comparable scale system in operation anywhere else.

Amazon states that it will make a series of changes to improve itself and avoid the recurrence of this type of event.

An interesting comment by Rightscale’s Lew Moorman in an interview with the
New York Times
: “Amazon’s outage is the cyber equivalent of a plane crash. This is an important episode with widespread damage. But air travel is still safer than traveling by car – analogous to cloud computing being more secure than data centers run by individual enterprises. Every day, in companies around the world, there are technology outages, each episode is very small, but it can waste more time, money and business.”

AWS Lessons and the Right Approaches to Using It

What can the customer do to correctly use the services mentioned to overcome the technical problems of the provider? First of all, the EC2 service used simply and individually does not guarantee high availability, but has an SLA of 99.95%, the same applies to RDS which depends on EC2 and EBS. But Amazon itself communicates that a correct use of services leads to highly reliable solutions. For example, using multiple deployment zones (NetFlix uses three), using EBS snapshots creates the ability to replicate the volume to other Availability Zones (the snapshot is physically located on the S3 system), back up data to S3, RDS backups and snapshots , or even enable replication on multi-AZ (between different Availability Zones). These are the approaches that have prevented certain customers from being offline despite the provider’s problems.

Aruba

Based on Aruba’s statements on the following communication: http://ticket.aruba.it/News/212/webfarm-arezzo-aggiornamenti-3.aspx

This morning at h. 04:30, a short circuit that occurred inside the battery cabinets serving the UPS systems of Aruba’s Arezzo Server Farm caused a fire: the fire detection system immediately went into operation, which in sequence turns off the air conditioning and activates the extinguishing system. As the smoke released by the combustion of the plastic batteries completely invaded the premises of the structure, the system interpreted the persistence of smoke as a continuation of the fire and automatically cut off the electricity.

The UPS system should be a switch of the main mains power supply, while a design error (human error) of the ventilation system of the UPS room, caused the shutdown of all systems, which for the Italian market meant the offline of millions of customer sites. As Aruba itself says in the press release, this error will be solved:

In addition, although it is customary to install batteries inside the data center, to avoid a repeat of what happened, from today the batteries of the Arezzo data center and all the other data centers of the Aruba Group will be installed in special rooms, external and separate from the main structure.

Google Gmail

In the case of the outage for some Gmail mail customers last February, as communicated in http://static.googleusercontent.com/external_content/untrusted_dlcp/www.google.com/it//appsstatus/ir/nfed4uv2f8xby99.pdf, the outage was caused by a bug inadvertently introduced in a software update (human error), and to avoid data integrity issues disabling access to Google Apps for the number of customers affected, the engineering team had to restore mailboxes from backup tapes, confirming that tape backups are still in use and still reliable.

Conclusions

Despite these incidents, as Lew Moorman says in the NYT interview, the large data centers managed by these large entities are always safer than the solutions that small and medium-sized companies could adopt.

Instead, the discussion should be shifted to a very complex issue that starts from the following observation:

because Facebook, Google and Amazon build servers (Facebook and Google specifically), datacenters (see Facebook OpenCompute Project), modify or create opensource software projects for their own needs (see Google’s Bigtable), for example the S3 EBS storage systems (they seem to use DRDB), SDB, where the beating heart are batteries of classic but powerful servers, dedicated network systems that replicate data between the numerous nodes, i.e. proprietary software solutions or modifications of opensource projects born in some university in the world and perhaps still under development on some world share.

the question is provocative for the vendors of maxi systems (IBM, HP, Dell, etc), but the answer could be in the following old publications ( Isilon Technology, EMC buys Isilon, HP buys LefHand, and other recent acquisitions), i.e. the big vendors only a few years ago began to understand the need to specialize in clustered distributed storage systems, because only these systems have the ability to respond to large amounts of data and large needs for bandwidth and simultaneous access, in addition to the enormous needs of Amazon, Google, Facebook tip the economic balance towards open or proprietary solutions compared to the licensing and support costs they would have through the vendors of the past.

In short, most of the software or hardware solutions of the services that we use and will use more and more, belonging or not to the cloud computing paradigm, are systems that, due to their size and scope, will never be effectively tested to avoid disaster.

Single Blog

Leave a comment Cancel reply